Problem Set 4 - Simple Linear Regression

Instructions

Complete the following exercises based on ModernDive Chapter 5 - Simple Linear Regression. Before beginning, create a new R Markdown document and give it a YAML header that includes the title “HPAM 7660 Problem Set 4”, your name, the date, and “pdf_document” as the output format.

As you answer each of the following questions, be sure to include your R code and associated output in your R Markdown document. Additionally, add a line or two describing what you’re doing in each code chunk.

Steps for Completing the Assignment

  1. Install and load the moderndive package. This package contains the datasets and helper functions we’ll be using throughout this problem set. You’ll also want to load the dplyr package.

  2. We’ll start by exploring the un_member_states_2024 dataset, which contains data on 181 UN member states and includes variables for life expectancy (life_expectancy_2022), fertility rate (fertility_rate_2022), and obesity rate (obesity_rate_2024). Use the glimpse() function to preview the dataset and then use tidy_summary() to generate summary statistics for all variables. How would you describe the typical life expectancy and fertility rate across countries in the data?

  3. Now let’s fit a simple linear regression model with fertility_rate_2022 as the outcome and life_expextancy_2022 as the explanatory variable. Use the lm() function to fit the model and the coef() function to display the estimated coefficients. Write out the regression equation using the estimated values of the intercept and slope.

  4. Interpret the slope coefficient from your regression. In practical terms, what does a one-year increase in life expectancy imply for a country’s fertility rate? Be sure to use the word “associated” in your answer and explain why we use that language rather than saying life expectancy causes changes in fertility.

  5. Use the get_regression_points() function to generate a data frame of observed values, fitted values, and residuals. Does the United States have a higher or lower fertility rate than the model predicts? What does that mean in practical terms?

  6. Now let’s shift to examining a categorical explanatory variable. Fit a linear regression model with life_expextancy_2022 as the outcome and continent as the explanatory variable. Display the coefficients. Which continent serves as the baseline for comparison, and how do you know? What is the model’s predicted life expectancy for countries in Europe?

  7. Interpret the coefficient on continentAsia. What does this value tell you about life expectancy in Asian countries relative to the baseline group? Is Asia above or below the baseline, and by how much?

  8. Use get_regression_points() with the ID = "country" argument to retrieve the fitted values and residuals for the continent model. Identify the five countries with the most negative residuals. What do large negative residuals tell us about these countries relative to others in their continent?

  9. Section 5.3.1 of ModernDive discusses the important distinction between correlation and causation. In your own words, what is a confounding variable, and why does its presence make it difficult to draw causal conclusions from a simple regression? Use the life expectancy and fertility rate example from the chapter to illustrate your answer.

  10. Once you’ve finished Step 12, knit your PDF document, upload it to the Problem Set 4 assignment link on Canvas and you’re done!